System and method for improving the accuracy of DNA sequencing and error probability estimation through application of a mathematical model to the analysis of electropherograms

ABSTRACT

A system and method for improving the accuracy of DNA sequencing and error probability estimation through application of a mathematical model to the analysis of electropherograms. The method includes processing a plurality of information obtained from a base calling system and creating a plurality of refined base calls using a plurality of original base calls and a plurality of intrinsic peak characteristics. A quality value is also assigned to each of the plurality of refined base calls using the plurality of intrinsic peak characteristics. Processing comprises detecting a plurality of peaks, expanding the plurality of peaks, and resolving the plurality of expanded peaks. Resolving may include fitting the plurality of expanded peaks using a model of a peak shape. A peak resolution parameter is calculated and used in processing. The system may also be trained.

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application is a continuation of U.S. patent applicationSer. No. 09/658,161, filed Sep. 8, 2000, which is incorporated herein byreference.

BACKGROUND

[0002] A. Field of the Invention This invention relates todeoxyribonucleic acid (DNA) sequencing. More specifically, thisinvention provides a method and system for improving the accuracy of DNAsequencing and error probability estimation through application of amathematical model to the analysis of electropherograms.

[0003] B. General Description Of The Area Of Research

[0004] With the advent of the Human Genome Project and its massiveundertaking to sequence the entire human genome, researchers have beenturning to automated DNA sequencers to process vast amounts of DNAsequence information. DNA, or deoxyribonucleic acid, is one of the mostimportant information-carrying molecules in cells. DNA is composed offour different types of monomers, called nucleotides, which are in turncomposed of bases linked with a sugar and a phosphate group. The fourbases are adenine (A), cytosine (C), guanine (G), and thymine (T). Theoriginal state of a DNA fragment is a double helix of two antiparallelchains with complementary nucleotide sequences. The coded information ofa DNA sequence is determined by the order of the four bases in either ofthese chains.

[0005] A common approach to obtaining information from DNA is the Sangermethod. In this method, single-stranded DNA fragments are used astemplates from which a series of nested subfragment sets is generated.(F. Sanger, et al., “DNA Sequencing With Chain-Terminating Inhibitors”,Proceedings of the National Academy of Sciences of the USA, vol. 74,pp.5463-5467 (1977)). The subfragments start at the same end of thetemplate, and a fraction of the subfragments of each length are causedto terminate by incorporation of chemically modified bases, therebyforming subfragment sets in increments of one nucleotide. In the popular“four-color” method, the terminating bases are labeled by one of fourfluorescent dyes specific to the terminating base type, A, C, G or T.(L. M. Smith, et al., “Fluorescence Detection In Automated DNA SequenceAnalysis”, Nature, vol. 321, pp. 674-679 (1986)). The resulting mixtureof sets of subfragments represents all of the possible sublengths of thetemplate, with each set of subfragments labeled by a fluorescent dyecorresponding to its terminating base type. To determine the sequence ofthe template, the subfragments are sorted by length usingelectrophoresis. In this process shorter subfragments migrate fasterthan longer subfragments in an applied electric field. Becausesubfragments are created in increments of one nucleotide, they passthrough an electrophoretic cell one at a time in the order of thenucleotides in the template. The terminating base types are identifiedby the wavelength at which they fluoresce. A real-time fluorescentdetection of migrating bands of the subfragments is then performed asthe subfragments pass through a detection zone. The light collected isprocessed with a set of spectral filters that attempt to isolate thesignals from the four dyes.

[0006] In automated DNA sequencing, these raw signals are then analyzedby a signal processing software. The steps of signal processing mayinclude downsampling of the data to 1 Hz if necessary, primer dataremoval, baseline adjustment, noise filtering, multicomponenttransformation, dye mobility shift correction, signal normalization,etc. (see, e.g., M. C. Giddings, et al, “A Software System For DataAnalysis In Automated DNA Sequencing”, Genome Research, vol. 8, pp.644-665 (1998)). Processing the raw data produces analyzedelectropherograms with clearly defined peaks. The analyzed data in theform of electropherograms are then processed using a base callingprogram. The base calling program infers a sequence of bases in the DNAfragment. This sequence of bases is also referred to as a read and isusually about 1,000 bases long. Not all of the called bases are used insubsequent processing. The statistically averaged error produced by anybase calling program is usually low, i.e., below 1%, for bases locatednear middle of a read and increases significantly toward the beginningand, especially, toward the end of a read. To characterize a reliable,or high quality part of a read, a threshold of 1% base calling error iscommonly accepted. That is, only that part of the read having an averagebase calling error of 1% or less will be subsequently used.Alternatively, this may be characterized in terms of the quality valuesassigned to bases, where the quality is the measure of reliability ofthe base call. According to a commonly used definition of qualityvalues, a quality value of 20 or higher corresponds to a probability oferror of 1% or less. In practice, when sequencing, the correct sequenceis not known in advance, so reliable predictions of quality values fornewly sequenced fragments based on previous training or calibration on adata set with a known correct sequence are desirable.

[0007] C. Prior Art

[0008] 1. ABI Base Caller

[0009] The ABI Base Caller is a part of DNA Sequencing Analysis softwareproduced by Applied Biosystems of Foster City, Calif. This program takesraw electropherograms as input, processes them to produce analyzedelectropherograms having well defined and evenly spaced peaks, and thendetects and classifies peaks in the analyzed electropherograms as asequence of bases. The program outputs the results to a binary filecalled an ABI sample file. The output includes the raw and analyzedelectropherograms for each of the four traces, the array of called basesand the array of locations assigned to the bases in an electropherogram.The output does not include an estimate of quality values, because theABI Base Caller program does not estimate the reliability of base calls.

[0010] The ABI Base Caller was chronologically the first and is stillone of the best base calling programs available. The base calls producedby the ABI Base Caller, however, are not very accurate toward the end ofa read, where peaks in an analyzed electropherogram become wider andsignificantly overlap. In this part of the read, the ABI Base Callerproduces a considerable amount of mismatch errors, unknown base callsthat are denoted as N's, and overlooks some base calls resulting indeletion errors.

[0011] 2. Phred

[0012] Phred is the first base calling software program to achieve alower error rate than the ABI software, and is especially effective atthe end of a read. Phred takes analyzed electropherograms produced bythe ABI Base Caller as input, calls the bases and assigns quality valuesto the called bases. (see B. Ewing, et al., “Base Calling Of AutomatedSequencer Traces Using Phred. I. Accuracy Assessment”, Genome Research,vol. 8(3), pp. 175-185 (1998); B. Ewing and P. Green, “Base-Calling OfAutomated Sequencer Traces Using Phred. II. Error Probabilities”, GenomeResearch, vol. 8(3), pp. 186-194 (1998)).

[0013] The base calling procedure in Phred consists of four phases:locating the predicted peaks, locating the observed peaks, matching theobserved and predicted peaks, and finding the missed peaks. In the firstphase, Phred attempts to find the idealized locations that all of thebase peaks that would have occurred in the absence of imperfections inthe sequencing reactions, in the electrophoresis process, and in traceprocessing. The underlying premise of Phred is that under such idealizedconditions, each trace consists of evenly spaced, non-overlapping peaks,corresponding to the labeled fragments that terminate at a particularbase in the sequenced strand. To find the positions of predicted peaks,Phred first examines the four trace arrays that correspond to each ofthe four bases to detect the peaks. A detected predicted peak isidentified as the location of the maximum value, or, if the maximum doesnot exist, the midpoint between the inflection points. The processedtrace is then scanned to find the regions of uniform peak spacing andthe average peak period. The average peak period corresponds topeak-to-peak spacing or inter-peak spacing. This is determined for eachof the regions. Phred then uses Fourier methods to find the positions ofthe predicted peaks in between these regions.

[0014] In the second phase, Phred locates all of the observed peaks byscanning the four trace arrays for regions that are concave. Accordingto Phred, the concave part of the trace located between two inflectionpoints is the observed peak. In the third phase, Phred matches observedand predicted peaks by assigning each observed peak to each predictedpeak. This is done via alignment of the two lists of peaks using adynamic programming algorithm. If, upon the completion of the thirdphase, no suitable observed peak can be assigned to a predicted peak,the corresponding base call is defined to be N, meaning that it isunknown or that it is not assigned.

[0015] Once the base calling procedure is completed, Phred assignsquality values to the bases. The quality value is the measure ofreliability of a given base call. If P is the probability that the basecall is incorrect, then the quality value QVis defined by the expressionQV=−10 log P, rounded to the nearest integer. Thus,

[0016] P=10⁻¹ corresponds to QV=10;

[0017] P=10⁻² corresponds to QV=20;

[0018] P=10⁻³ corresponds to QV=30. etc.

[0019] To assign quality values to called bases, Phred trains onestablished data for which the correct base sequence, referred to as aconsensus, is already known. Phred's training software compares theactual base calls with the consensus to determine the positions at whichPhred makes base calling errors. Phred then stores the trace conditionsunder which a particular peak was incorrectly called. Specifically, foreach called base or peak, Phred computes and stores four parameters,peak height ratio in a window of three peaks, peak height ratio in awindow of seven peaks, peak spacing ratio in a window of seven peaks andpeak resolution. These four parameters are referred to as the traceparameters. These parameters are useful in discriminating base callingerrors from correct base calls. For each base, these four parameters areexpressed as functions of characteristics of the corresponding peak, aswell as the characteristics of several other peaks flanking the currentpeak. Smaller trace parameter values correspond to higher qualitymeasurements.

[0020] Based on this stored information, a text file called a lookuptable is generated by Phred's training software. This file storesinformation about average base calling errors that corresponds to eachgiven set of the four trace parameter values. The lookup table isprovided to users of Phred as a part of the source code and is used whenassigning quality values to called bases. Phred's lookup table iscalibrated on data generated by the ABI PRISM® 377 DNA Sequencer(available from Applied Biosystems of Foster City, Calif.). When Phredis run on data produced by the newer generation DNA sequencers (e.g.,ABI PRISM® 3700 DNA Analyzer), the quality values produced are not veryaccurate. Specifically, the quality values determined by Phred are lowerthan those experimentally produced, causing false low qualitydeterminations that result in unused base calls.

[0021] D. Inadequacies of the Prior Art

[0022] Currently used tools for automated DNA sequencing such as ABI'sBase Caller, Phred and others operate with peak characteristics, such asarea, position, height, etc. (see, e.g., A. Berno, “A Graph TheoreticApproach To The Analysis Of DNA Sequencing Data”, Genome Research, vol.6, pp.80-91 (1996); M. C. Giddings, et al., “A Software System For DataAnalysis In Automated DNA Sequencing”, Genome Research, vol. 8, pp.644-665 (1998)). These characteristics are used to discern true peaksfrom noise before performing base calling, as well as to computeimportant trace parameters, which are used to calibrate quality valuesassigned to called bases. To compute the characteristics of a peak of agiven dye color, the signal from the trace of the particular color isused. Because this signal is actually a sum of individual signals fromall peaks of the particular color, the presence or absence of otherpeaks of the same color in close vicinity of a given peak significantlyaffects the characteristics computed for a given peak. For example, theapparent height of a given peak computed as the total signal at thepeak's position would generally overestimate the true, or intrinsicheight of this peak. This occurs because other peaks of the same colorwhich overlap with the current peak amplify the signal at the currentpeak's position. These characteristics, therefore, are not very accurateand should be considered as “apparent”. The phenomenon of peakoverlapping is particularly noticeable toward the end of a trace, whichcorresponds to the end of a fragment. Toward the end of a trace, peaksbecome wider and overlap, so that the difference between the apparentand intrinsic peak characteristics becomes significant. And it is towardthe end of a trace where the currently used, prior art base callingsystems fail to provide accurate results.

[0023]FIG. 1 (prior art) illustrates a sample electropherogram showinginflection points, apparent peaks and apparent peak positions asidentified by Phred. Phred uses inflection points to detect peaks whichare more accurately described as apparent peaks. A peak is defined asthe concave part of an electropherogram, E, located between twoinflection points. Inflection points i₁ through i₈ are used to identifypeaks P₁, P₂, P₃ and P₄. Only the shaded areas in FIG. 1 are regarded asthe peak area and are used in subsequent processing. The rest of theelectropherogram, where the signal may still be high, is ignored byPhred, therefore ignoring this useful information. This may be aconsiderable loss if two or more peaks overlap, such as with peaks P₂,P₃ and P₄.

[0024] The apparent peak characteristics are computed in Phred in thefollowing way. Apparent peak area is the area below the apparent peak.Apparent peak position is the position in an electropherogram whichbisects the apparent peak's area and is shown as POS₁, POS₂, POS₃ andPOS₄. Apparent peak height is the signal at the apparent peak position.Apparent peak spacing is the distance between the positions of oneapparent peak and a previous apparent peak in an electropherogram.

[0025] Partially due to inadequate processing of traces and peaks,toward the end of a trace, the commonly used base calling programs, suchas Phred and the ABI Base Caller, produce a considerable number ofunknown base calls designated as N. This happens, for example, when twoor more good peak candidates for base calling are found at a givenposition in the trace or, conversely, when no peaks are found near theposition where a base call should be made. Because Ns are not present inthe correct sequence, they may be regarded as mismatch errors. From apractical point of view, it is desirable to re-call Ns to a best guessbase call and express the uncertainty associated with the base throughan assigned quality value.

[0026] Another form of inadequate trace processing results in missedbases, or deletion errors, produced, particularly, by the ABI BaseCaller towards the end of read. This may occur when two or moreoverlapped peaks are mistakenly identified and called as one. To avoidthis type of error, Phred splits excessively wide peaks into narrowerpeaks. However, Phred's peak splitting procedure may be inaccurate andmay result in undesired consequences which include the splitting andsubsequent calling of noise peaks, such as dye blobs, the peakscorresponding to a pure dye, rather than to any DNA subfragment.

[0027] Inadequate processing of peaks not only affects base calling, butmay also result in assigning inaccurate quality values to bases. Theapparent peak characteristics and apparent trace parameters that areused by prior art systems such as Phred to assign a quality value to acalled peak depend significantly on the peak's environment. That is,whether there are other peaks of the same dye color in a close vicinityof the current peak. Because the environment of an average peak may varysignificantly from one dataset to another, quality values assigned tobases of one dataset based on a quality value calibration derived fromanother dataset may produce inaccurate results. Specifically, Phred,which is calibrated using the data produced by ABI PRISMS® 377 DNASequencers, may underestimate quality values of bases generated by theABI PRISM® 3700 DNA Analyzer.

[0028] Even if proper calibration is used, Phred usually assigns lowquality values to unresolved bases. Unresolved bases are typicallypresent near the end of a read. A base is unresolved if it is called asN or if for at least one of its neighboring bases, there is no pointbetween two corresponding peaks at which the signal is less than thesignal at each peak. Thus, with an improved base calling and peakprocessing procedure that includes the re-calling of Ns and resolving ofunresolved bases, higher and more accurate quality values may beassigned to bases. This improved processing may also produce more usefuldata by increasing the number of bases with sufficiently high qualityvalues, such as 20 or better.

BRIEF SUMMARY OF THE INVENTION

[0029] The present invention provides a novel approach to peakprocessing using intrinsic peak characteristics in computations relatedto base calling and quality value calibration instead of the apparentcharacteristics that are used in prior art systems. The use of intrinsiccharacteristics allows for more accurate base calls and quality valueassignments. The method of the present invention is based on theapplication of a mathematical model of peak electrophoresis to discerntrue peaks from noise and to process rows of overlapping peaks of agiven color. The processing of the present invention includes threesteps: peak detection, peak expansion and peak resolution. The firststep employs inflection points to find individual peaks. The peakboundaries are then expanded by scanning the trace to the left from theleft boundary of each peak and scanning the trace to the right from theright boundary of each peak. Finally, the groups of overlapping peaksare resolved into individual intrinsic peaks. For single peaks, theresolution step consists of simply fitting the data peak with the model.For multiple peaks, resolution requires a more sophisticated iterativeprocedure.

[0030] The mathematical model of peak electrophoresis is further used todiscern true peaks from noise. The expected shape of a current peak iscomputed as an average shape of a group of previous good peaks that wererefined according to the model. By comparing an expected shape of acurrent peak with the shape of a current peak, a determination is madewhether the current peak is a true peak or is noise so that only truepeaks are called.

[0031] The present invention also resolves wide peaks into descendentpeaks more accurately than prior art systems. If the width of a currentpeak is greater than a system specified expected width, the peak will beresolved into two or more intrinsic peaks having model peakcharacteristics. Certain adjustable parameters of these descendent peaksare selected so as to fit the observed data. In this way, the sum of thesignals of the multiple superimposed intrinsic peaks are approximatelyequal to the apparent signal of the original parent peak.

[0032] The present invention increases the usefulness of DNA data bycalling bases which are unassigned and unassignable by prior artsystems. By re-calling originally unknown bases, inserting new bases andresolving overlapping peaks, the useful read length of bases with highquality values is extended.

BRIEF DESCRIPTION OF THE DRAWINGS

[0033]FIG. 1 (prior art) depicts a sample electropherogram showinginflection points, apparent peaks, and apparent peak position asidentified by Phred

[0034]FIG. 2 depicts a hardware environment in which the method andsystem of the present invention may be implemented.

[0035]FIG. 3 depicts a general flow of actions taken according to oneembodiment of the method and system of the present invention.

[0036]FIG. 4 depicts a flow of actions taken to detect peaks accordingto one embodiment of the method and system of the present invention.

[0037]FIG. 5 depicts a sample electropherogram showing peaks andexpanded peaks.

[0038]FIGS. 6A and 6B depict a flow of actions taken to expand peaksaccording to one embodiment of the method and system of the presentinvention.

[0039]FIG. 7 depicts a sample electropherogram showing apparent peaksand intrinsic peaks.

[0040]FIG. 8 depicts a flow of actions taken to refine original basecalls according to one embodiment of the method and system of thepresent invention.

[0041]FIG. 9 depicts a flow of actions taken to refine originally calledand known bases according to one embodiment of the method and system ofthe present invention.

[0042]FIG. 10 depicts a flow of actions taken to refine base callsdesignated as unknown or N according to one embodiment of the method andsystem of the present invention.

[0043]FIG. 11 depicts a flow of actions taken to call uncalled peaksaccording to one embodiment of the method and system of the presentinvention.

DETAILED DESCRIPTION

[0044] A. Overview of the Invention

[0045] The present invention provides a novel approach to peakprocessing. This approach uses the intrinsic peak characteristics incomputations related to base calling and quality value calibrationinstead of the apparent characteristics that are used in prior artsystems. The use of intrinsic characteristics allows for more accuratebase calls and quality value assignments.

[0046] Another aspect of the present invention is a method for computingintrinsic peak characteristics. The method of the present invention isbased on application of a mathematical model of a peak electrophoresis,which results in an expression for the peak shape, to processing of rowsof overlapping peaks of a given color. The processing of the presentinvention includes three steps: peak detection, peak expansion and peakresolution. The first step employs inflection points to find individualpeaks. The peak boundaries are then expanded by scanning the trace tothe left from the left boundary of each peak and scanning the trace tothe right from the right boundary of each peak. The purpose of peakexpansion is to include the entire electropherogram with nonzerointensity of signal and to find all groups or rows of overlapping peaksin each trace. The number of peaks in the group may vary from one, for asingle peak, to more than ten, which is characteristic of the signaltoward the end of trace, where peaks are wide and may overlap. Finally,the groups of overlapping peaks are resolved into individual intrinsicpeaks superimposed over the overlapping peaks. For single peaks, theresolution step consists of simply fitting the data peak with the model.For multiple peaks with more than one peak in the group, resolutionrequires a more sophisticated iterative procedure which is discussed inmore detail below.

[0047] Yet another aspect of the present invention is the use of themathematical model of a peak electrophoresis to discern true peaks fromnoise. The present invention includes a simple model that takes intoaccount the two basic physical processes involved in electrophoresis:the electromigration of DNA fragments in an applied electric field andtheir Brownian diffusion. Electromigration is responsible for thetranslation of a peak without changing its shape, while diffusionresults in evolution of the peak shape. This model assumes that,initially, each peak has a rectangular shape with a finite width andpredicts that the peak shape at any time is a function of threeadjustable parameters. Two of these parameters are locally invariant.That is, these two parameters have nearly identical values for peaksthat are close neighbors in the trace. Thus, the expected values ofthese parameters, and hence the expected shape of a current peak, can bedetermined by scanning the trace toward the beginning of the trace froma current peak and finding and fitting a few closely distanced goodpeaks. By comparing the expected shape of the current peak, computed asan average shape of a few previous good peaks, with the actual currentpeak's shape, the present invention determines whether the current peakis a true peak or noise. This determination is used when calling peaks.

[0048] Phred also uses average characteristics of a few previous peaksto determine or correct some characteristics of a current peak. Morespecifically, Phred compares the width of a current peak, defined as thedistance between a couple of inflection points, with the average widthof a few previous peaks. If the current peak is found to be wider than asystem-specified width, the peak is split into two or more narrowerpeaks by bisecting, trisecting, etc. the peak. However, as discussedabove, the peak width, as defined in Phred, is an apparentcharacteristic, which does not correspond to a physical, realcharacteristic of a peak. This is particularly important when peaksoverlap. As such, the peak splitting procedure implemented in Phred isrelatively crude and inaccurate. In particular, Phred is not capable ofrecognizing dye peaks, which correspond to pure dye rather than to DNAsubfragment. Dye peaks have a characteristic shape and should not becalled. Phred may processes dye peaks as regular wide peaks, bysplitting them and subsequently calling the descendent peaks. The peakprocessing system of the present invention, bases a determination of theshape of a current peak on a comparison with the shape expected from themodel and resolves complex peaks into individual peaks in a more naturalway. According to the method of the present invention, if the currentpeak's width is found to be greater than a system specified expectedwidth, the peak will be resolved with two or more intrinsic peaks that,by definition, have model peak characteristics. The adjustableparameters of these descendent peaks, including their heights, widthsand positions, are selected so as to fit the observed data. In this way,the sum of the signals of the superimposed intrinsic peaks areapproximately equal to the apparent signal of the original, currentpeak.

[0049] The method of the present invention increases the usefulness ofDNA data. Bases which are unassigned and unassignable by prior artsystems are called. In particular, when refining Ns produced by eitherthe ABI Base Caller or by Phred, the traces in which the originallocations of N calls fall into uncalled peaks are considered, and, amongsuch traces, those with the highest intrinsic peaks are selected. Thepresent invention also increases the usefulness of DNA data by insertingbases missed during an original base call analysis. As discussed above,the ABI Base Caller typically produces deletion errors toward the end ofa read. As an improvement, the method of the present invention includesscanning the list of all peaks to find good peaks that were not calledby the ABI Base Caller and inserting corresponding bases into the calledsequence. When selecting appropriate candidate peaks, the presentinvention uses intrinsic peak characteristics. More specifically, anuncalled peak is considered a candidate for insertion if it is nottruncated and its intrinsic height is greater than the intrinsic signalof any other peak at the candidate peak's position.

[0050] Further, the method of the present invention modifies thedefinitions for three trace parameters used in prior art systems, suchas Phred, to calibrate quality values. To compute these parameters, thepresent invention uses intrinsic peak heights and positions instead ofapparent values used by prior art systems such as Phred. The method ofthe present invention also introduces a new trace parameter, peakresolution, which is entirely different from a parameter of the samename used by Phred.

[0051] Because of the way Phred defines its fourth parameter, Phredtypically assigns low quality values to unresolved bases and the basesclose to unresolved bases in the read, which may result in false errorvalues. A base is unresolved or unknown if it is called as N or if forat least one of its neighboring bases, there is no point between twocorresponding peaks at which the signal is less than the signal at eachpeak. The method of the current invention, however, does not leaveunresolved bases. The method of the current invention, re-calls Ns andresolves overlapping peaks. This significantly extends the useful readlength estimated as a number of bases with sufficiently high qualityvalues, such as 20 or better, in a read.

[0052] B. A Hardware Environment

[0053]FIG. 2 depicts a hardware environment in which an embodiment ofthe method and system of the present invention may be implemented. Inthis embodiment, the method is implemented as software stored in andexecuted by a computer such as computer 10. Computer 10 may be anycomputer that can execute the software programs that receive data frombase calling system 15 that includes DNA analyzing software which isexecuting on a computer. Base calling system 15 receiveselectropherogram data, also known as raw data, from a DNA sequencer,such as DNA sequencer 20. DNA sequencer may be any DNA sequencer oranalyzer such as, for example, the ABI PRISM® 377 DNA Sequencer or ABIPRISM® 3700 Analyzer available from Applied Biosystems of Foster City,Calif. Base calling system 15 may be any base calling system, such as,for example, ABI PRISM® Sequencing Analysis Software version 3.4available from Applied Biosystems of Foster City, Calif., which includesthe ABI Base Caller as one of its components, and is referred to hereinas the ABI Base Caller. In one embodiment, one or more DNA sequencers 20may provide raw data to one base calling system 15. DNA sequencers 20receive DNA samples 24, processes the DNA sample, and send informationin the form of electropherogram data, also referred herein to as a DNAsample file, to base calling system 15. Base calling system 15 sendsinformation to and may receive information from computer 10 viacommunications controller 30. Communications controller 30 and basecalling system 15 must share a well known communications protocol, suchas, for example, serial, parallel, and any variations thereof. Inanother embodiment, a plurality of DNA sequencers may be connected to abase calling system over a computer network. In this embodiment,computer 10 may communicate with base calling system 15 via a computernetwork (not shown).

[0054] In one embodiment, computer 10 comprises processing unit 12,display device 14, and one or more input devices, such as keyboard 16and mouse 18. In this embodiment, processing unit comprises processor 40and memory 42. Processor 42 may be any computer processor, and memory 40may be any random access memory (RAM) or other readable and writeablememory device. Processor 40 executes the software that implements themethod of the present invention utilizing memory 42. Information,including software that implements the method of the present invention,DNA sample files, etc. are read from and written to storage device 32which is coupled to storage device controller 34. Storage device 32 maybe a hard disk drive, a readable and writeable compact disk (CDRW)drive, a floppy disk drive, etc. Storage device 32 may be any device bywhich a machine may read from a machine readable medium such as thedevices already mentioned, as well as, but not limited to, a stick orcard memory device, a digital audio tape (DAT) reader, etc. In oneembodiment, storage device 32 may be a plurality of disk drivescomprising a disk array or other configuration. The processor maycommunicate instructions to display controller 36 to display images ondisplay device 14. Display controller 36 may be any display controller,and display device 14 may be any display monitor, including, but notlimited to, a cathode ray tube (CRT) display monitor and a thin filmtransistor (TFT) display screen. A system user may access computer 10via any computer input device, such as, for example, keyboard 16 andmouse 18 which are coupled to the processor by I/O controller 38.

[0055] Processor 40, memory 42, storage device controller 34, displaycontroller 36, I/O controller 38, and communications controller 30, arecoupled to one another via and communicate with one another over bus 44.Bus 44 may be any bus that provides for communication of and betweencomponents within a computer. Although only one bus is depicted,multiple buses may be used in computer 10. In addition, other componentsand controllers (not depicted) or multiple instances of depictedcomponents and controllers may be included in computer 10.

[0056] The present invention reads input data from DNA sample filesstored on the storage device in either ABI format or SCF format. EachDNA sample file contains information about one DNA fragment and isapproximately 1000 bases long. More specifically, the present inventionreads a plurality of information from DNA sample files. First, the totalnumber of called bases in the DNA fragment and the sequence of thesebases is read. This sequence is referred to as original base calls andmay be determined by any third party's base calling system, such as, forexample, the ABI Base Caller. Second, the locations of the original basecalls in the electropherogram, which may be referred to as calledlocations, are read. Last, the number of data points in each of fourelectropherograms and four analyzed electropherograms, which arerepresented in the form of integer arrays, are also read.

[0057] The present invention improves the base calls and locations readfrom the sample file and assigns a quality value to each base call. Theresults are then output in one or more output files in one or more of avariety of common formats compatible with prior art systems. A file in afirst output format, the Phred compatible .phd file, consists of aheader followed by three columns of data that contain the base calls,assigned quality values, and base locations. A file in a second outputformat, the .qual file, consists of a header followed by the qualityvalues assigned to the called bases, separated by spaces. A file in athird output format, the .seq file, consists of a header followed by asequence of called bases. A file in a fourth output format, the .qrfile, consists of data used to plot histograms of read lengthdistribution for a given set of fragments. The data in the fourth outputfile format are comprised of two columns of information. The firstcolumn represents a bucket of read lengths, and the second columnincludes the number of fragments read having a quality value of 20 ormore for a given bucket.

[0058] C. A Method

[0059]FIG. 3 depicts a general flow of actions taken according to oneembodiment of the method and system of the present invention. Thepresent invention comprises three basic steps: peak processing, as shownin block 110, refining base calls, as shown in block 120, and qualityvalue assignment, as shown in block 130. In one embodiment, peakprocessing may include detecting peaks, as shown in block 112, expandingpeaks, as shown in block 114, and resolving multiple peaks, as shown inblock 116. In one embodiment, refining base calls may include matchingoriginal base calls with peaks, as shown in block 122; re-callingunknown bases, as shown in block 124; removing unmatched original bases,as shown in block 126; and inserting new base calls, as shown in block128. In one embodiment, assigning quality values comprises computingtrace parameters, as shown in block 130, and obtaining quality valuesfrom a lookup table, as shown in block 134. Each of the actionscomprising the method are discussed in more detail below.

[0060] In another embodiment, the second step may be thought of as basecalling that includes refining original base calls and inserting newbase calls. In this embodiment, refining base calls may include matchingoriginal base calls with peaks, re-calling unknown bases, and removingunmatched original bases.

[0061] D. Peak Processing

[0062] 1. Peak Detection

[0063]FIG. 4 depicts a flow of actions taken to detect peaks accordingto one embodiment of the method and system of the present invention.Peaks in each of the four traces are detected by identifying allapparent peaks, as shown in block 136, computing apparent peakcharacteristics, as shown in block 138, ignoring insufficiently largepeaks, as shown in block 140, and creating a peak list for each dyecolor, as shown in block 142. The traces corresponding to different dyecolors are processed sequentially, one trace at a time.

[0064] To detect peaks in a given trace, inflection points are found byscanning the trace. An inflection point is the position in anelectropherogram where the second derivative of the intensity of signalreverses sign. If x[i]=i is the position of the i^(th) data point in theelectropherogram, and y[i]q is the intensity of signal at the positionx[i]q, then the second derivative of the signal intensity is computed asy″[i]=y[i+1]−2*y[i]+y[i−1]. A peak is determined to be the concave partsof the electropherogram located between the two inflection points: theleft inflection point where y″[i] turns negative, that is, y″[i−1]≧0 andy″[i]<0; and the right inflection point where y″[i] becomes positiveagain, that is, y″[i−1]≦0 and y″[i]>0. Other portions of theelectropherogram are not included as peaks at this stage of theanalysis.

[0065] After peaks are detected, apparent peak characteristics arecomputed. These characteristics are used temporarily, because they areestimated using the observed or apparent signal, which is actually a sumof signals from individual peaks of the same dye color, and aretherefore not very accurate. The apparent peak characteristics computedinclude width, area, height and spacing. Peak width is the distancebetween the peak beginning at the left inflection point and the peak endat the right inflection point. Peak area is the area below the peak.Peak position is the position in the electropherogram which bisects thepeak area. Peak height is the signal at the peak position. Peak spacingis the distance between the positions of the current peak and theprevious peak of the same dye color.

[0066] While detecting apparent peaks, lists of peaks of each color aresimultaneously created. Each peak is modeled as a structure comprised ofapparent peak characteristics. Each peak list is an array of peakstructures. Small peaks are ignored and are not included in the list. Agiven peak i is considered small if it has an area that is equal to orless than 10% of the average area of 10 preceding non-truncated peaksincluded in the list, or if it has an

a_(i)≦0.1 {overscore (a)}_(i−10 . . . i−1) or a_(i)<0.05 a_(i−1)

[0067] area that is less than 5% of the area of me immediately precedingnon-truncated peak. That is, where a_(i) is the area of the i^(th) peak.

[0068] 2. Expansion of Peaks

[0069] Detected peaks are subsequently expanded to include almost theentire electropherogram with non-zero signal, so that upon the expansionthey look increasingly natural and their type, i.e., single andmultiple, can be determined. In addition, expansion allows for later useof more data in analyzing the peaks. FIG. 5 depicts a sampleelectropherogram showing detected and expanded peaks. Detected peaksPEAK1, PEAK2, PEAK3, and PEAK4 are depicted within solid boundary lines.PEAK1 as initially detected is located between lines 146 and 148, PEAK2as initially detected is located between lines 152 and 154, PEAK3 asinitially detected is located between lines 156 and 158, and PEAK4 asinitially detected is located between lines 160 and 162. After peakexpansion, expanded peak EPEAK1 is located between lines 144 and 151,expanded peak EPEAK2 is located between lines 151 and 155, expandedEPEAK3 is located between lines 155 and 159, and expanded EPEAK4 islocated between lines 159 and 163. After expansion, expanded peaksEPEAK2, EPEAK3 and EPEAK4 share boundaries 155 and 159 (shown as dashedlines).

[0070]FIGS. 6A and 6B depict a flow of actions taken to expand peaks ofthe same dye color according to one embodiment of the method and systemof the present invention. Peaks in the trace of a given dye color areprocessed sequentially, one peak at a time, starting from the leftmostpeak. Referring now to FIG. 6A, the leftmost peak is obtained, as shownin block 202. The processing of each peak consists of three steps. Inthe first step, a trace adjacent to the detected peak is scanned,starting from the left inflection point to the left, toward thebeginning of the trace, as shown in block 204. The scanning continuesuntil any of the beginning of the trace, a zero value of signalintensity, a local minimum of signal intensity or the right inflectionpoint of the previous peak is detected, whichever event occurs first, asshown in blocks 206, 208, 210 and 212. If a zero value of signalintensity is not detected, as shown in block 206, a check is made todetermine if a local minimum of signal intensity is detected, as shownin block 210. If neither of these are detected, scanning continues untilthe right inflection point of the previous peak is detected, as shown inblock 212. The position where the scanning stops is designated as thenew left boundary of the current peak as shown in block 214.

[0071] In the second step, the new left boundary of the current peak iscompared to the new right boundary of the previous peak, as shown inblock 216. Based on this comparison, these boundaries are eitheraccepted, or redefined. After that, both the boundaries are classifiedas one of three types, as shown in block 218. If the current peak is theleftmost peak in the list of peaks of a given color, then its new leftboundary is accepted unconditionally and classified as a Type 1boundary. If the current peak is not the leftmost peak, then the newleft boundary is compared to the new right boundary of the previouspeak. If the new left boundary of the current peak and the new rightboundary of the previous peak coincide and the intensity of signal ateach of these boundaries is less than 10% of the mean height of thecurrent and previous peak, then both the new boundaries are accepted andclassified as Type 1 boundaries. Examples of Type 1 boundaries areboundaries 144, 151, and 163 of FIG. 5. If the new left boundary of thecurrent peak coincides with the new right boundary of the previous peakand the intensity of signal at any of these boundaries is greater thanor equal to 10% of the mean height of the current and previous peak,then both the new boundaries are accepted and classified as Type 2boundaries. Boundary 159 as shown in FIG. 5 is a Type 2 boundary.Finally, if the position of the new left boundary of the current peak islocated to the left from the new right boundary of the previous peak,then both the new boundaries are redefined. The new positions of boththe boundaries are set to the midpoint between the right inflectionpoint of the previous peak and the left inflection point of the currentpeak. Both the boundaries are then classified as Type 3 boundaries.Boundary 155 is a Type 3 boundary, as shown in FIG. 5.

[0072] Referring now to FIG. 6B, in the third step, the trace adjacentto the detected peak is scanned, starting from the right inflectionpoint, toward the end of the trace, as shown in block 220. The scanningcontinues until any of the end of the trace, a zero value of signalintensity, a local minimum of signal intensity or the left inflectionpoint of the next peak is detected, whichever event occurs first, asshown in blocks 222, 224, 226 and 228. If a zero value of signalintensity is not detected, as shown in block 222, a check is made todetermine if a local minimum of signal intensity is detected, as shownin block 226. If neither of these are detected, scanning continues untilthe left inflection point of the previous peak is detected, as shown inblock 228. The position where the scanning stops is designated as thenew right boundary of the current peak, as shown in block 230. A checkis then made to determine whether there are more peaks remaining in thetrace, as shown in block 232. If more peaks remain, the next peak to theright of the current peak is obtained and set as the current peak, asshown in block 240. Analysis then continues at block 204. If there areno further peaks, then the current peak is the rightmost peak in thelist of peaks of a given color, and the new right boundary is acceptedunconditionally and classified as a Type 1 boundary, as shown in block234.

[0073] If after expansion a peak has new left and right boundaries oftypes “i” and “j”, respectively, the peak may be referred to as a peakof Type “ij”. For example, single peaks are peaks of Type “11”. Thefirst and the last peaks in a row of overlapping peaks are of the Types“1i” and “j1”, respectively, where each of i and j is either 2 or 3. Forexample, referring to FIG. 5, EPEAK1 is a peak of Type 11, EPEAK2 is apeak of Type 12, EPEAK3 is a peak of Type 23, and EPEAK4 is a peak ofType 31.

[0074] When a peak's boundaries are expanded, all of the peak's apparentcharacteristics should be re-computed.

[0075] 3. Resolving Multiple Peaks Into Intrinsic Peaks

[0076] As mentioned in the previous section, after expansion,overlapping peaks of the same color may not be separated and may shareboundaries. Moreover, characteristics of overlapping peaks may still beinaccurate. For example, the height of a given expanded peak, defined asthe signal at the peak position, may be inaccurate because it isestimated using the observed signal, which may actually includecontributions from neighboring, overlapping peaks. According to themethod of the present invention, this inaccurate height is regarded asthe apparent height. FIG. 7 depicts a sample electropherogram showingapparent peaks and intrinsic peaks. The apparent electropherogram, 250,is determined to have three peaks with apparent peak heights 252, 254and 256 and apparent peak positions 253, 255 and 257. It is theseapparent peak heights and apparent peak positions that are used by priorart systems. To compute the intrinsic height of a peak, contributionsfrom all the neighboring peaks to the signal at the peak position aresubtracted. Similar adjustments are made to other apparent peakcharacteristics.

[0077] By definition, the intrinsic peak characteristics are measuredfrom the data when no other peak of the same dye color is present inclose vicinity of the current peak, thus eliminating any interferencefrom surrounding peaks which plagues prior art methods. A mathematicalmodel of peak electrophoresis is used to compute the expected shape ofeach peak. The shape is determined as a given function of threeadjustable parameters. These parameters are computed for each peak byfitting a defined part of the electropherogram with the model. Theresulting peak, having the expected shape and located at the expectedposition is called the intrinsic peak. Referring again to FIG. 7, theresult in this example is the determination of three intrinsic peaks,IPEAK1, IPEAK2 and IPEAK3, with intrinsic peak heights 262, 264 and 266and having intrinsic peak positions 263, 265 and 267. According to themethod of the present invention, intrinsic peak characteristics are usedin subsequent processing. This is an improvement over prior art systems,all of which use apparent peak characteristics in base calling and errorprobability calculations.

[0078] a. Mathematical Model of a Peak Shape

[0079] A simple model is used for peak electrophoresis. The modeldescribes the evolution of a peak while the DNA sample moves through anelectrophoretic cell. The model takes into account two physicalprocesses: first, electromigration of DNA fragments with velocity vunder an applied electric field; and, second, Brownian diffusion of themolecules of the fragments with diffusion coefficient D.Electromigration is responsible for the translation of a peak in anapplied field without changing the shape of the peak, while diffusionresults in evolution of the peak shape. A rectangular peak of a finitewidth is used as an initial condition. The shape of the peak observed inan electropherogram is assumed to be the result of evolution in a timeit took the peak to pass through an electrophoretic cell.

[0080] The mathematical model of peak electrophoresis is represented by:$\begin{matrix}{\frac{\partial C}{\partial t} = {{{- v} \cdot \frac{\partial C}{\partial x}} + {D\quad \frac{\partial^{2}C}{\partial x^{2}}}}} & {{Equation}\quad (1)}\end{matrix}$

[0081] This equation is a one-dimensional differential equation of theconvective diffusion type. In this equation, t represents time, xrepresents the position in the electrophoretic cell which is roughlyproportional to the position in the electropherogram, and C is the localconcentration of the considered type of DNA subfragments, which isproportional to the intensity of signal produced by the subfragments.The minus sign before v signifies that the subfragments are assumed tomigrate from the left to the right, that is, in the positive directionof x.

[0082] Equation (1) ignores a number of physical phenomena accompanyingelectrophoresis, such as the effect of the DNA subfragments (which arecharged molecules) on the local conductivity of the medium and theinteraction of the DNA fragments with buffer electrolyte, the lowmolecular weight ions which are also present in the electrophoreticcell. Even so, the approximations, work well when the concentration ofthe DNA fragments is much less than the concentration of the bufferelectrolyte.

[0083] A reasonable initial distribution for the DNA fragments is therectangular profile of height C₀ and width 2w₀ centered at the positionx=0: $\begin{matrix}\begin{matrix}{{t = 0};} & \quad & \quad & {{C\left( {x,0} \right)} = \left\{ \begin{matrix}{0,} & {{x < {- w_{0}}};} \\{C_{0},} & {{{- w_{0}} \leq x \leq w_{0}};} \\0 & {x > {w_{0}.}}\end{matrix} \right.}\end{matrix} & {{Equation}\quad (2)}\end{matrix}$

[0084] A solution to Equations (1) and (2) is given by $\begin{matrix}{{C\left( {x,t} \right)} = {\frac{1}{2}{C_{0} \cdot \left\lfloor {{{Erf}\left( \frac{w_{0} - x + {vt}}{2\sqrt{Dt}} \right)} + {{Erf}\left( \frac{w_{0} + x - {vt}}{2\sqrt{Dt}} \right)}} \right\rfloor}}} & {{Equation}\quad (3)}\end{matrix}$

[0085] where Erf(z) is the error function,${{Erf}(z)} = {\frac{2}{\sqrt{\pi}}{\int_{0}^{z}{^{- z^{2}}{{z}.}}}}$

[0086] This solution, Equation (3), describes the displacement of thepeak at constant velocity v in the direction of positive x accompaniedby evolution of the peak shape as the result of the Brownian diffusionof DNA fragments in the peak. At any time t, the peak shape remainssymmetric. The symmetry of the peak shape is the consequence of theassumptions that the original peak shape at the time t=0 is symmetricand that the interaction of the DNA with each other and with the bufferelectrolyte may be ignored. In practice, the observed analyzed peaks areslightly asymmetric, which may be the result of the asymmetry of aninitial distribution of subfragments, which is not accounted for byEquation (2), or of physical processes ignored by the model, or even theresult of distortion of the data as analyzed by the original signalprocessing system, such as, for example, the ABI Base Caller.Nevertheless, Equation (3) provides a good first approximation.

[0087] b. Fitting A Single Peak

[0088] To fit a single expanded peak, it is assumed that its position,x₀=vt, may be determined from the electropherogram, and the origin ofthe new reference frame is placed at this position. Thus,$\begin{matrix}{{C\left( {z,t} \right)} = {\frac{1}{2}{C_{0} \cdot \left\lbrack {{{Erf}\left( \frac{w_{0} - z}{2\sqrt{Dt}} \right)} + {{Erf}\left( \frac{w_{0} + z}{2\sqrt{Dt}} \right)}} \right\rbrack}}} & {{Equation}\quad (4)}\end{matrix}$

[0089] The coordinate, z, in the new reference frame is defined byz=x−x₀, and the peak shape in the new coordinate system is symmetricwith respect to the origin z=0. Equation (4) depends on threeparameters: C₀, w₀ and β=(Dt)^(1/2). If these parameters, together withthe peak position x₀=vt, are estimated by fitting the data, then theintrinsic peak will be uniquely defined. To estimate the threeparameters C₀, w₀ and β, three measurable characteristics of a peak areneeded. The three characteristics are: peak area, S, peak height, H; andpeak width, W, defined as a width at the height H/2. Once thesecharacteristics are computed from the data, the parameters C₀, w₀ and βfor a given single peak are estimated using the following set of threenonlinear equations: $\begin{matrix}{H = {{C_{0} \cdot {Erf}}\quad \frac{w_{0}}{\beta}}} & {{Equation}\quad (5)}\end{matrix}$

 S=2w₀·C₀  Equation (6) $\begin{matrix}{{{{Erf}\left( \frac{w_{0} - {W/2}}{\beta} \right)} + {{Erf}\left( \frac{w_{0} + {W/2}}{\beta} \right)}} = {{Erf}\left( \frac{w_{0}}{\beta} \right)}} & {{Equation}\quad (7)}\end{matrix}$

[0090] Although single peaks fit relatively accurately, the fit is notperfect. The method of the present invention uses best-fit parametersC₀, w₀ and β, to compute a peak resolution. Peak resolution is a measureof deviation of the model from the data. Peak resolution is later usedfor both base calling and computing quality values. The peak resolutionis computed by determining the curve which describes the absolutedifference between the model signal, given by Equation (4), and thedata, i.e., the observed signal in the trace. This absolute differencecurve is a function of the position in the electropherogram and isdefined between the apparent beginning and apparent end of the expandedapparent peak. The peak resolution is computed as the ratio of the areaunder the absolute difference curve to the area of the expanded apparentpeak. Thus, the peak resolution is always non-negative and equals zeroif and only if the model ideally fits the data. Normally, the value ofpeak resolution is less than unity. For peaks that are fit well, peakresolution is typically approximately 0.1 or less.

[0091] c. Resolving Multiple Overlapping Peaks

[0092] Because the model shape of each peak depends on three adjustableparameters, to fit a group of n overlapping peaks, 3n independentparameters describing the shapes of intrinsic peaks, plus n parametersdescribing their positions are adjusted simultaneously. Experienceindicates that, in some cases, n may be as high as 10 at the end oftrace. The computational complexity of this problem may be reduced byexploiting specific properties of the physical model. Two of the threeparameters of the peak shape are local invariants. That is, two of thethree parameters are assumed to share the same values as neighboringpeaks in the trace. These invariant parameters are: w₀ and β.

[0093] Parameter w₀ is the half-width of the original volume where allthe DNA fragments were placed before the start of the electrophoresisexperiment. Parameter β=(D t)^(1/2) is also expected to remain the samefor neighboring peaks. This is because such peaks spend about the same tin the electrophoretic cell, and because the diffusion coefficient D ofthe subfragments located in these peaks are expected to be about thesame. Since w₀ and β are the local invariants, it follows from Equations(5), (6) and (7) that the measurable peak characteristics S/H and Warealso local invariants. To fit n overlapping peaks of the same color,these invariants are used in a sequence of computations.

[0094] In one embodiment, the average values of S/H and W are computedamong five good immediately preceding apparent peaks. The peak isregarded as good if it is Type 11, 12 or 21 (according to the peak typesdiscussed above), and it has a peak resolution of 0.2 or less. If thepeak type is 12, the left half of the peak is used to compute S and W,and the result is doubled. If the peak type is 21, the right half of thepeak is used to compute S and W, and the result is doubled.

[0095] From these average values, the shape parameters w₀ and β arecomputed using Equations (5), (6) and (7) and assigned to all theoverlapping peaks in the group. For each peak in the overlapping group,the rest of the peak parameters, the initial height C₀ and the intrinsicpeak position, ipos, are computed via an iterative procedure. First, azero-order approximation for these parameters is chosen. In thiscomputation, the zero order approximation for C₀ is referred to as C₀⁽⁰⁾ and the zero order intrinsic position is referred to as ipos⁽⁰⁾. C₀⁽⁰⁾ is computed as C₀ ⁽⁰⁾=H/Erf(w₀/β), where H is the apparent height ofthe expanded peak, and ipos⁽⁰⁾ is set to the apparent position of thepeak. The next order approximation for parameters C₀ and ipos of eachpeak in the group is then computed by processing all the peaks of thegroup sequentially in one pass from the left to the right. To computethe next order approximation for parameters C₀ and ipos of a given peak,the intrinsic signals of all other peaks are subtracted from theoriginal data signal corresponding to the n peaks. The resulting signalis used to determine the next order approximation for C₀ and ipos.Specifically, ipos⁽¹⁾ is set to the position of maximum of the resultingsignal and C₀ ⁽¹⁾ is set to the ratio of the maximum value of theresulting signal to Erf(w₀/β). In one embodiment, this next ordercomputation is repeated four times.

[0096] E. Refining Base Calls

[0097] After peaks have been processed such that intrinsic peakcharacteristics have been determined, original base calls are refined.Base call refining comprises determining which of the peaks correspondto real bases of the DNA fragment, sorting out noise peaks, andinferring the DNA sequence.

[0098]FIG. 8 depicts a flow of actions taken to refine base callsaccording to one embodiment of the method and system of the presentinvention. In one embodiment, data from a base calling system, such asthe ABI Base Caller, are used to obtain a trace and make a first try atcalling bases. These called bases, the original base calls, are obtainedfrom a sample file along with other data as an input, as shown in block310. These original base calls are regarded as a first-orderapproximation to the true DNA sequence. The original base calls are thenedited or refined. This refining may also be referred to as re-callingas the originally called bases are called again for a second time. Foreach original base call, the corresponding peak in the electropherogramis searched at the location in the trace corresponding to the originalbase call, as shown in block 312. If the peak is found, and the peak isa the true peak, the peak's base is called, as shown in block 314. Whenappropriate, wide peaks are resolved into narrower peaks, as shown inblock 315. If a peak is determined to be noise or is not found, it iscalled as N or unknown, as shown in block 316. All base calls labeled asunknown or N are then analyzed, and a best guess is used to replace Nsby one of ‘A’, ‘C.’, ‘G’ or ‘T’. Unknown bases are either re-called, asshown in block 318, or rejected. If no appropriate peak is found at ornear the unknown base location, those unknown bases for which no goodpeak is found are removed, as shown in block 320. Newly added base callsmay then be inserted into the original base call sequence, as shown inblock 322.

[0099] 1. Refining Original Base Calls

[0100] The sequence of bases called by the original base calling system,including unknown or N bases, is refined. A peak corresponding to agiven base is determined using the original base call location from thesample file together with other data.

[0101] a. Known Original Bases

[0102]FIG. 9 depicts a flow of actions taken to refine originally calledand known bases according to one embodiment of the method and system ofthe present invention. Original known base calls are obtained, as shownin blocks 330. Then, the peak list for one of the four particular colorsrepresenting A, C, G and T is scanned sequentially, starting from thepeak having the smallest location index, as shown in block 332. For eachoriginal base that is known, that is, it was originally designated as A,C, G or T, the list of peaks of the color corresponding to the base issearched for a peak such that the original base location falls into thepeak area, as shown in block 334. That is, a check is made to determinewhether a peak in which the original base location falls between thebeginning and end of the peak is found.

[0103] If a peak is found, as shown in block 334, a check is made todetermine whether the peak has already been assigned, as shown in block336. That is, a check is made to determine if any of the previouslyconsidered bases had a location falling into the peak's area. If thepeak has not already been assigned, as shown in block 336, then the peakis called and the base is assigned to it, as shown in block 352. Morespecifically, the base is accepted as called, the peak is marked called,and the base location is reset to the intrinsic position of the calledpeak.

[0104] If no peak is found at the location of the original base, asshown in block 334, then the base is set as N so that it will bereconsidered in later processing, as shown in block 350.

[0105] If a peak corresponding to the current original base is found, asshown in block 334, referred to as the second base, and the peak wasalready called and assigned to another base, as shown in block 336,referred to as the first base, that is, one of the previously refinedbases had an original location falling into the peak's area, then adetermination is made whether the peak is wide enough to be split intotwo good peaks, as shown in block 338. If the peak is truncated, it willbe considered wide enough to split. A peak is considered truncated whenthe data signal of the peak reaches the maximum value of the signaldetected for the trace. If the peak is not truncated, then to determinewhether a peak is wide enough to split, the ratio of the apparent peakarea (computed earlier when expanding peaks) to the intrinsic peakheight is computed. This ratio is compared to the similar ratio averagedamong five immediately preceding good peaks. If the computed ratio isgreater than the average ratio multiplied by the factor 1.125, then thepeak is wide enough to split. The factor 1.125 has been determinedempirically through optimization of the base calling accuracy. If a peakis wide enough, it is resolved into two peaks, as shown in block 346,and the two peaks are called and the two bases are reassigned, as shownin block 348. That is, each of the two bases, the first base and thesecond base, is assigned to one of the two newly resolved peaks, theoriginal peak is removed from the list of peaks of this trace, and thetwo new peaks are inserted into this list.

[0106] If the peak is not wide enough to split, as shown in block 338,that is, if the computed ratio is not greater than the average ratiomultiplied by this factor, the peak will not be resolved into two peaks.One of the two bases, the first base and the second base, is assigned tothis peak, as shown in block 340. Specifically, the two original baselocations are compared to the peak position. Whichever base is locatedcloser to the peak position is assigned to the peak. If the two originalbase locations are equidistant from the peak position, then the one thatis located where the signal is stronger is selected. As to the otherbase, the location of the previous and next base in the tracecorresponding to the current base is searched in an attempt to find agood peak. That is a check is made to determine whether there is a goodpeak near the other found but, as yet, unassigned base, as shown inblock 342. If no good peaks are found near the other base, the base isset as unknown or N so that it is considered in later analysis, as shownin block 350. If a good peak is found near the other base, the peak iscalled and the base is assigned to it, as shown in block 352.

[0107] A check is then made to determine whether there are more knownoriginal base calls, as shown in block 354. If there are more, then thenext original base call is obtained, as shown in block 356, andexecution continues at block 332. If there are no further original basecalls, then this portion of the processing ceases, as shown in block358.

[0108] b. Unknown Bases

[0109]FIG. 10 depicts a flow of actions taken to refine base callsdesignated as unknown or N according to one embodiment of the method andsystem of the present invention. Those bases designated as unknown or N,both originally and during refining of original base calls, areprocessed. More specifically, the bases designated as unknown or N whenrefining the original base calls as well as the bases that wereoriginally located but designated as unknown or N by the original basecalling system are analyzed in an attempt to refine or re-call them to abest guess. The unknown or N base calls are obtained, as shown in block360. The peak lists containing data for all four traces A, C, G and Tare scanned, as shown in block 362. A check is made to determine whetherthere are is an uncalled peak at the location of the unknown base, asshown in block 364. If there are no peaks at the base location, a checkis then made to determine whether there are any peaks in any of the fourtraces near the unknown base location, as shown in block 374. If thereare no good peaks near the unknown base, the unknown base call isrejected, and it is removed from the list of called bases, as shown inblock 376.

[0110] If there are one or more peaks found at the base location, thebest peak and its trace are selected at the base location, as shown inblock 366. This selection is made according to the following rules:

[0111] a non-truncated peak is preferred over a truncated peak;

[0112] among non-truncated peaks, only those peaks having a peakresolution parameter of 0.2 or less are considered;

[0113] between two non-truncated peaks, their intrinsic height iscompared and the peak with the greater height is selected; and

[0114] among two truncated peaks, the wider peak is selected.

[0115] After a best peak is found, a check is made to determine if thispeak was already assigned, that is, was called previously, as shown inblock 368. If the peak was not previously called, the peak is called,and the base is assigned to it, as shown in block 380. Morespecifically, the peak is marked called and the unknown designator or Nis replaced with the base corresponding to the best peak.

[0116] If the peak was already assigned to another base, a check is madeto determine if the peak is wide enough to split, as shown in block 370.This analysis is the same as that discussed above regarding block 338.If the peak is wide enough to split, the peak is resolved into twopeaks, as shown in block 378, and then the peak is called and the baseis assigned to it, as shown in block 380. More specifically, theoriginal peak is removed from the peak list corresponding to the trace,each of the two new peaks are marked called, each of the new peaks areinserted into the peak list, the current base and the previously calledbase are assigned to one of the descendant peaks, and the unknowndesignator or N is replaced with the base corresponding to the assignedpeak. In one embodiment, if the best peak is too narrow, the base may beleft as is and may be processed in a subsequent step. If the peak is nowide enough to be split, the peak is assigned to one of the two bases,as shown in block 372. Processing then continues at block 374 asdiscussed above.

[0117] If a peak is not found at the unknown base location, but a goodpeak is found near the unknown peak, then the peak is called and thebase is assigned to it, as shown in block 380. More specifically, thepeak is marked called, the unknown designator or N is replaced with thebase corresponding to the trace of the near peak, and this base isassigned to the peak. A check is then made to determine if there are anyfurther unknown or N base calls to be refined, as shown in block 382. Ifthere are more, the next unknown or N base call is obtained, andprocessing continues at block 362. If thee are no further unknown or Nbase calls to process, the processing ceases at block 386.

[0118] 2. Inserting New Base Calls

[0119] All the peaks, both called and uncalled for all colors are thenreviewed, and some previously uncalled peaks are called.

[0120] a. Creating A Multi-Colored Ordered List of all Peaks

[0121] Before reviewing all the peaks, a single mixed list of peaks ofall colors is created from the four uni-colored peak lists that existedup to this point. To create the single list of peaks, each of the fouruni-colored peak lists is scanned simultaneously starting from peakswith lower peak positions. Peaks are selected for the single listaccording to the following rules:

[0122] peaks having lower intrinsic positions are selected before otherpeaks;

[0123] if two or more peaks from different traces are found which havethe same positions, called peaks are placed before uncalled peaks;

[0124] if two or more peaks from different traces have the samepositions and all are called, the peak having a lower index in itscorresponding uni-colored peak list is placed first;

[0125] if two or more peaks from different traces have the samepositions and all are uncalled, the peak with the greater intrinsic peakheight is placed first;

[0126] if two or more peaks from different traces have the samepositions, and all are uncalled, and all have the same intrinsic peakheight, then the peak with greater area is placed first.

[0127] b. Checking the Peak Calling Criteria and Calling Good Peaks

[0128]FIG. 11 depicts a flow of actions taken to call uncalled peaksaccording to one embodiment of the method and system of the presentinvention. In this, the last step in the base calling analysis, thesingle multi-colored peak list is scanned, as shown in block 420, in aneffort to locate good but uncalled peaks. Each successful peak candidatewill satisfy a multitude of requirements. By definition, the peak mustnot have been previously called in an earlier portion of the analysis.The peak must not be not truncated. When the first uncalled peak isobtained, as shown in block 422, at least three analyses are performed.Each of the analyses is made in succession; if any of them fail, nofurther checks are made, and the peak is not called.

[0129] First, a check is made to determine whether the base index, whichis the index of the base which would be inserted into the called basesequence if the peak would be called, is greater than a system specifiedrecommended minimum. In one embodiment, this minimum is set to 600. Acheck is then made to determine whether the intrinsic height of theuncalled peak is greater than the intrinsic signal of any other peak atthe location corresponding to the candidate peak's position, as shown inblock 426. Last, a check is made to determine whether the spacingbetween neighboring called base locations is large enough for insertionof a new base, as shown in block 428. That is, a check is made todetermine if the spacing between neighboring called base locations issufficiently larger than the average spacing of preceding bases. Morespecifically, the distance in the electropherogram between the positionsof the two called peaks, the one immediately preceding and the otherimmediately following the current peak candidate, must be no less thanthe average peak spacing among the 10 immediately preceding called peaksmultiplied by the peak insertion factor, PIF, which is computed as:Last, a check is made to determine whether the spacing betweenneighboring called base locations is large enough for insertion of a newbase, as shown in block 428. That is, a check is made to determine ifthe spacing between neighboring called base locations is sufficientlylarger than the average spacing of preceding bases. More specifically,the distance in the electropherogram between the positions of the twocalled peaks, the one immediately preceding and the other immediatelyfollowing the current peak candidate, must be no less than the averagepeak spacing among the 10 immediately preceding called peaks multipliedby the peak insertion factor, PIF, which is computed as:${PIF} = \left\{ \begin{matrix}{\infty,} & {{base\_ index} \leq 600} \\{{2 - {1.1 \cdot {\left( {{base\_ index} - 600} \right)/200}}},} & {600 < {base\_ index} \leq 800} \\0.9 & {{base\_ index} > 800}\end{matrix} \right.$

[0130] When peaks meeting each of these criteria are found, the peaksare called and corresponding new bases are inserted into the called basesequence, as shown in block 430. A check is then made to determine ifthere are any further uncalled peaks, as shown in block 432. If thereare more uncalled peaks, the next uncalled peak is obtained, as shown inblock 434, and processing continues at block 424. If thee are no furtheruncalled peaks, the processing ceases at block 436.

[0131] F. Quality Value Assignment

[0132] Quality values are then assigned to the called bases. Traceparameters and a lookup table are used to compute quality values. Thequality value, QV, for a given called base position is defined as:

QV=−10·logP

[0133] rounded to the nearest integer. Here, P is the probability oferror in base calling at this position. To be able to successfullycompute the QVs, the system trains on data for which the correct basesequence is already known. By comparing this correct sequence with whatthe system outputs, the base positions at which the system makes basecalling errors are identified. These base positions are stored bytraining software together with the “conditions” in the trace(represented by trace parameters) under which the error was made. Usingthese data and a dynamic programming code, which is also a part of theraining software, values of the four trace parameters are then mapped toa quality value. This map is saved in a file called the “lookup table”.The lookup table is used when assigning quality values. To determine thequality value for a given base, the four trace parameters for the baseare computed from the electropherogram. The lookup table is thenconsulted.

[0134] 1. Computing Trace Parameters

[0135] The system involves four trace parameters. All of these traceparameters involve intrinsic peak characteristics. Three of the traceparameters, the two peak height ratios and the peak spacing ratio, aresimilar to those used in Phred, but in the method of the presentinvention these trace parameters are derived from intrinsic, rather thanapparent peak heights and positions. The fourth trace parameter is peakresolution. This parameter was defined above. Peak resolution as definedand used herein has nothing in common, except its name, with Phred'speak resolution pararneter.

[0136] Two peak height ratios, phr3 and phr7, are used as traceparameters. To compute phr3, a window of three called peaks in theelectropherogram, centered at the current called peak is selected. Theratio of the intrinsic height of the largest uncalled peak in the windowand the lowest called peak defines phr3. If there are no uncalled peaks,the largest of the three uncalled trace array values at the location ofthe called base is used instead. Trace parameter phr7 is computedsimilarly, but with a window of seven called peaks. To compute the peakspacing ratio, psr7, a window of seven called peaks in theelectropherogram, centered at the current called peak, is selected. Theratio of the largest peak-to-peak spacing in the window to the smallestpeak-to-peak spacing is then calculated. Only called peaks areconsidered when computing psr7 and the spacing is defined as thedistance between the positions of adjacent called peaks. The definitionof the peak resolution parameter, pres, is as set forth above. In oneembodiment, pres is a double precision value.

[0137] 2. Obtaining Quality Values

[0138] After the four trace parameters phr3₀, phr₀, psr7₀ and pres₀ fora given called base calculated, the lookup table is consulted toestimate the quality value of the base. In one embodiment, the lookuptable consists of five columns. The first column contains (integer)quality values. Each of the four other columns contains double precisionthreshold values for the trace parameters phr3_(T), phr7_(T), psr7_(T)and pres_(T).

[0139] The lookup table is scanned line by line, starting from the firstline, until a line satisfying the following conditions is found:

phr3₀≦phr3_(T); phr7₀≦phr7_(T); psr7₀≦psr7_(T); pres₀≦pres_(T).

[0140] The quality value from the first column of this line is assignedto the base. Upon finding such a line, scanning of the lookup table isterminated.

[0141] G. Training the System

[0142] To successfully implement and use the system and method of thepresent invention, the system must be calibrated by training the system.The training procedure consists of two major steps: first, a trainingdata file is generated by extracting and storing information from samplefiles and a consensus sequence; and, second, a lookup table is generatedusing the extracted information. As such, training receives as input aset of sample files also known as a dataset, and the correct targetsequence, also known as the consensus sequence. The dataset may consistof from a few hundred to a few thousand sample files. The consensussequence may be over 100,000 bases long. Instead of a single consensussequence, a plurality of consensus sequences, each matching a givensubset of fragments, may be passed as an input to the training software.

[0143] At the extraction and storing step, the method begins byprocessing electropherograms for each fragment, determining the calledsequence of the fragments' bases, and computing trace parameters foreach called base. The method then determines which part of each fragmentis actually relevant to the consensus sequence. This is achieved by alocal alignment of the called fragment with the consensus sequence. Onlythose bases corresponding to the relevant part of the fragment are usedin the training process. The computed result for each “good” base isthen stored in an ASCII data file, which may be referred to as atraining file. The training file contains the information about thevalues of four trace parameters for each good base, the location indexof the base in the fragment and consensus sequence, and whether the basecall was correct.

[0144] At the second step, a lookup table is generated. According to oneembodiment of the method, information is read from the training file. Abinning procedure for each of the four trace parameters is thenperformed in which the good bases are partitioned into a specifiednumber of subsets which may be referred to as bins. Each bin containsonly the bases for which the given trace parameter lies within certainlimits, called thresholds. The values of the thresholds are selected andcomputed so that each bin contains roughly the same number of bases. Thetotal number of thresholds for a given trace parameter is, therefore,equal to the specified number of bins plus one. A lookup table is thengenerated. The lookup table comprises a plurality of lines. Each linecontains four parameter thresholds, together with a quality valuecorresponding to those thresholds. Multiple lines may have the samequality value.

[0145] 1. Generating A Training File

[0146] To generate a training file, peaks are processed and bases arecalled as discussed above. After bases in the fragment are called, thenext step is to determine which part of the fragment is actuallyrelevant to the consensus. This is done by aligning the called fragmentwith a corresponding consensus sequence and finding the region of theirlocal alignment. To this end, a FASTA algorithm (W. R. Pearson and D. J.Lipman D. J., “Improved Tools For Biological Sequence Analysis”,Proceedings of the National Academy of Sciences of he USA, vol. 85,pp.2444-2448 (1988)) and a linear Smith-Waterman (SW) algorithm (T. F.Smith and M. S. Waterman, “Identification Of Common MolecularSubsequences”, Journal of Mathematical Biology, vol.147, pp.195-197(1981)) may be implemented. The FASTA algorithm is used for a quickpreliminary search of the best local alignment region, whereas the SWalgorithm does the actual alignment. In one embodiment, the defaultparameters of the SW algorithm are: match premium=+10; mismatchpenalty=−20; and insertion/deletion penalty=−16.

[0147] For better results, in one embodiment, a vector part of thefragment should be masked before doing the alignment with consensus. Avector is a part of genome of the bacteria used as a host to multiplythe DNA fragment before performing sequencing. After thismultiplication, the DNA fragment is separated from the vector using arestriction enzyme. The restriction enzyme cleaves the DNA at specificsites called the restriction sites. The restriction sites are located onthe vector, so some portion of the vector sequence still remains as apart of a fragment. To mask the vector part of the fragment, thefragment sequence is aligned against the known vector sequence.

[0148] In one embodiment, the results of alignment of the fragmentsequence with the consensus sequence are then output into the trainingfile in nine columns, one line per base. The first column may containthe position of the base in the fragment sequence. The second column maycontain the fragment's base character at the position of the base in thefragment sequence such that ‘-’ may be output in this column torepresent a gap such as when consensus has a base at this position andthe fragment does not. The third column may contain the position of thebase in the consensus sequence. The fourth column may contain basecharacter of the consensus at position of the base in the fragmentsequence such that ‘-’ may be output for the insertion when the fragmenthas a base at this position and the consensus does not. The fifth columnmay contain the number “1” if the base call is correct based the basecharacter in the fragment matching the base character in the consensus,may contain the number “0” if the base call is incorrect, or may containnothing, a blank space if there is no base call at the current position.The sixth, seventh, eighth and ninth column contain the values of thetrace parameters phr3, phr7, psr7 and pres, respectively, if thecharacter in the second column does not represent a gap. If thecharacter in the second column represents a gap, then the sixth,seventh, eighth and ninth columns all contain empty space.

[0149] As some of the bases may not represent the best alignment, theymay be excluded from the training file. More specifically, if thefraction of errors within the best-alignment region exceeds 10%, thematch between the fragment and consensus sequence may be regarded asoccasional, and therefore, the whole alignment may be rejected and notstored in the training file. In addition, if two or more distinctregions in the consensus sequence are found which match the fragmentwithin a specified threshold, namely, those with a fraction of errors≦85% within the best alignment region, then the consensus sequence maybe regarded as having repeats, and all of the alignments may be rejectedsuch that the sample file may be ignored.

[0150] 2. Generating A Lookup Table

[0151] a. Generating an Ordered List of Unique Values of TraceParameters

[0152] The values of each trace parameter are read from the trainingfile and are then sorted by value. If two or more identical values ofthe same trace parameter are found, they will be “merged” into a single“unique” value which will be assigned a weight equal to the number ofidentical values for which this parameter is found. At the conclusion ofthis procedure, four couples of arrays are created. Each couple consistsof an ordered double type array of unique values {U_(i)} of the traceparameter and integer array of their weights {w_(i)}, both the arrayshaving the same length K: U₁, U₂, U₃, . . . U_(K) and w₁, w₂, w₃, . . .w_(K).

[0153] b. Generating A List Of Threshold Values

[0154] A list of trace parameter threshold values is generated bypartitioning values of a given trace parameter into a number of groups,which are referred to as bins. Bins have the following characteristics:first, different bins contain approximately the same number of parametervalues; and, second, bins are ordered such that each parameter value ina given bin is greater than any parameter value in the previous bin andless than any parameter value in the next bin. The thresholds aredefined as the boundaries between neighboring bins. These twocharacteristics can only be satisfied if all the values of a given traceparameter are distinct. In this situation, the expected occupancy of abin is equal to the ratio of the total number of good bases used fortraining to the desired number of thresholds, M, which is specified by auser of the training method. In one embodiment, it has been foundbeneficial to set M to equal 50. In practice, however, trace parameterscomputed for different bases may have identical values. As mentionedabove, these values may be merged into unique values, and each uniquevalue may be assigned a weight. If the weight of a given unique valueexceeds the expected occupancy of a bin, then it should occupy aparticular bin alone. In other words, a unique value can not be “split”between two or more bins, otherwise, the second characteristic will notbe met.

[0155] In one embodiment, the thresholds of the trace parameters arecomputed. The trace parameters, represented by weighted unique numbers,are then partitioned into bins according to the following method. First,the expected occupancy of the first bin is computed as S₁=N/M, where Srefers to occupancy, N is the total number of bases stored in thetraining file, and M is the specified number of bins or thresholds to beused. The value of the last threshold is equal to the largest uniquevalue of trace parameter. Second the smallest unique value is assignedto the first bin. Third, the expected occupancy of the first bin iscompared to the weight of the first unique value. That is, if w₁>S₁,then the next unique value will be assigned to the next bin, the valueof the first threshold, thr₁, will be determined as thr₁=(U₁+U₂)/2, andthe expected occupancy of the second bin will be determined as:S₂=(N−w₁)/(M−1). However, if w₁<S₁, another comparison is made. Ifw₁+w₂>S₁, the two numbers, w₁ and w₁+w₂ are compared. If the firstnumber w₁ is closer to the expected occupancy of the first bin, S₁, thenthe first bin is processed as described above. If the second number, thesum w₁+w₂, is closer, then both the first and second unique numbers areassigned to the first bin, the third unique number is assigned to thesecond bin. The first threshold is determined as thr₁=(U₂+U₃)/2, and theexpected occupancy of the second bin is calculated asS₂=(N−w₁−w₂)/(M−1). Otherwise, the list of unique numbers is scanneduntil a number U_(i) is found such that

w₁+w₂+ . . . +W_(i−)1<S₁

[0156] and

w₁+w₂+ . . . +w_(i)>S₁.

[0157] Then, if

|w₁+w₂+ . . . +w_(i−1)−S₁|<w₁+w₂+ . . . +w_(i)−S₁|,

[0158] the first bin is populated with W₁, w₂, . . . w_(i−1) and set

S ₁═(N−w ₁ −w ₂ − . . .−w _(i−1))/(M−1), thr ₁=(U _(i−1) +U _(i)/2.

[0159] Otherwise, the first bin is populated with w₁, w₂, . . . w_(i)and set

S ₁=(N−w ₁ −w ₂ − . . . −w _(i−1) −w _(i))/(M−1), thr ₁=(U _(i) +U_(i+1))/2.

[0160] Fourth, all subsequent bins are populated as described in thethird step,

[0161] c. Generating The Table

[0162] After the bins are populated, lookup table entries are generated.First, a cut (i, j, k, l) is defined to be a rectangle in a fourdimensional space of trace parameters, which includes all the bases withtrace parameters in the range phr3≦phr3_thr_(i), phr7≦phr7_thr_(j),psr7≦psr7_thr_(k) and pres≦pres_thr_(l), where i, j, k and l rangebetween 1 and M, and M is the number of thresholds for each of fourtrace parameters. There are M⁴ cuts in all. A list of all consideredcuts is created.

[0163] Let err_(i, j, k, l) and corr_(i, j, k, l) be the total number oferroneous and correct base calls in the cut (i, j, k, l). Then, wedefine the error rate for the cut by:$e_{i,j,k,l} = {\frac{\delta_{i,j,k,l} + {err}_{i,j,k,l}}{\delta_{i,j,k,l} + {err}_{i,j,k,l} + {corr}_{i,j,k,l}}.}$

[0164] The corresponding quality value is defined by:

QV_(i,j,k,l)=−10·log₁₀ e _(i,j,k,l).

[0165] Here, $\delta_{i,j,k,l} = \left\{ \begin{matrix}{0,} & {{{err}_{i,j,{kj},l} > 0};} \\{1,} & {{err}_{i,j,k,l} = 0.}\end{matrix} \right.$

[0166] is a small-sample correction that may be added to insure thatboth the numerator and denominator are positive even whenerr_(i, j, k, l)=0.

[0167] The following two steps are then iterated to create the lookuptable. First, the cut (i, j, k, l) for which QV_(i, j, k, l) is largestis found. If two or more cuts with the largest QV are found, the cutwhich contains the larger total number of bases,err_(i, j, k, l)+corr_(i, j, k, l) is selected. If there is more thanone of these, the cut for which the sum of the indexes, i+j+k+l, ishighest is selected. The valueQV_(i, j, k, l and threshold values phr)3_thr_(i), phr7_thr_(j),psr7_thr_(k) and pres_thr_(l) are then output. The cut (i, j, k, l) isthen deleted from the list of considered cuts. Second, for eachremaining cut (i′, j′, k′, l′), the counts err_(i′, j′, k′, l′) andcorr_(i′, j′, k′, l′) are adjusted by deleting the bases which it sharedwith cut (i, j, k, l). The values e_(i′, j′, k′, l′) andQV_(i′, j′, k′, l′) are then recomputed using the new values. Iferr_(i′, j′, k′, l′) and corr_(i′, j′, k′, l′) are 0 for all remainingcuts, the generation ends. Otherwise, generation continues with thefirst step.

[0168] For a typical value of M=50, there are M⁴=6.25 million cuts atall, so the procedure of adjusting the base counts in each cut after thecut (i, j, k, l) with largest QV_(i, j, k, l) is found may becomputationally slow. To accelerate this procedure, a 4-dimensionaldynamic programming algorithm is implemented, whereby the base countsfor the current cut are computed using the base counts of a few previouscuts in the list of all cuts. The number of correct base callscorr_(i, j, k, l) and the number of erroneous base callserr_(i, j, k, l) are determined for all values of i, j, k, and l asfollows. Let c(i,j,k,l) be the number of correct base calls in the cut(i,j,k,l) and let b(i,j,k,l) be the number of correct base calls in thebin (i,j,k,l), where i, j, k and l range between 1 and M, and M is thenumber of thresholds for each of four trace parameters. To account forboundary conditions, let the value of c(i,j,k,l) be 0 if i=0 or j=0 ork=0 or l=0. The recursive relationship used by the dynamic programmingalgorithm is $\begin{matrix}{{c\left( {i,j,k,l} \right)} = {{b\left( {i,j,k,l} \right)} + {c\left( {{i - 1},j,k,l} \right)} + {c\left( {i,{j - 1},k,l} \right)} -}} \\{{{c\left( {{i - 1},{j - 1},k,l} \right)} + {c\left( {i,j,{k - 1},l} \right)} -}} \\{{{c\left( {{i - 1},j,{k - 1},l} \right)} - {c\left( {i,{j - 1},{k - 1},l} \right)} +}} \\{{{c\left( {{i - 1},{j - 1},{k - 1},l} \right)} + {c\left( {i,j,k,{l - 1}} \right)} -}} \\{{{c\left( {{i - 1},j,k,{l - 1}} \right)} - {c\left( {i,{j - 1},k,{l - 1}} \right)} +}} \\{{{c\left( {{i - 1},{j - 1},k,{l - 1}} \right)} - {c\left( {i,j,{k - 1},{l - 1}} \right)} +}} \\{{{c\left( {{i - 1},j,{k - 1},{l - 1}} \right)} + {c\left( {i,{j - 1},{k - 1},{l - 1}} \right)} -}} \\{{{c\left( {{i - 1},{j - 1},{k - 1},{l - 1}} \right)}.}}\end{matrix}$

[0169] The same recursive relationship applies when we alternatively letc(i,j,k,l) be the number of erroneous base calls in the cut (i,j,k,l)and let b(i,j,k,l) be the number of erroneous base calls in the bin(i,j,k,l), where i, j, k and l range between 1 and M.

[0170] The use of the dynamic programming algorithm for the rapidgeneration of a site specific lookup table is a significant improvementof the method of present invention over prior art systems that maintaina fixed lookup table. This algorithm allows complete training of thesystem on a typical dataset comprised of a few thousands sample files injust a few hours. That means customized calibrations of the system onthe data generated by any particular user at any particular site can bedriven quickly and used instead of a single generic calibration (i.e.,lookup table) provided by prior art systems such as Phred. Because theelectropherograms produced by DNA sequencers on different sitesgenerally differ in characteristic peak heights, shapes and/or spacing,the trace parameters computed from these data will also differ from onesite to another. Thus, more reliable predictions of quality values forbases sequenced at a given site can be made if a lookup table generatedfrom the data produced at the site is used.

What is claimed is:
 1. A method comprising: processing a plurality ofinformation obtained from a base calling system; creating a plurality ofrefined base calls using a plurality of original base calls and aplurality of intrinsic peak characteristics; and assigning a qualityvalue to each of the plurality of refined base calls using the pluralityof intrinsic peak characteristics.
 2. The method of claim 1 whereinprocessing comprises: detecting a plurality of peaks; expanding theplurality of peaks into a plurality of expanded peaks; and resolving theplurality of expanded peaks.
 3. The method of claim 2 wherein detectingcomprises: identifying a plurality of inflection points; and selectingthe plurality of peaks based on the plurality of inflection points. 4.The method of claim 3 wherein detecting further comprises: computing aplurality of apparent peak characteristics.
 5. The method of claim 3wherein selecting comprises: ignoring those peaks that areinsufficiently large.
 6. The method of claim 2 wherein expandingcomprises: scanning to the left from a left inflection point of eachpeak of the plurality of peaks to determine an expanded left boundaryfor the peak; and scanning to the right from a right inflection point ofeach peak of the plurality of peaks to determine an expanded rightboundary for the peak.
 7. The method of claim 6 wherein scanning to theleft comprises: if a zero signal is detected, designating the locationat which the zero signal is detected as left expanded boundary for thepeak; if the beginning of a trace is detected, designating the locationat which the beginning of a trace is detected as the left expandedboundary for the peak; if a local minimum is detected, designating thelocation at which the local minimum is detected as the left expandedboundary for the peak; if a right inflection point of a previous peak isdetected, designating the location at which the right inflection pointof the previous peak is detected as the left expanded boundary for thepeak; if the left expanded boundary is to the left of a right expandedboundary of the previous peak, redefining the left expanded boundary ofthe peak and right expanded boundary of the previous peak as themidpoint between a left inflection point of the peak and the rightinflection point of the previous peak.
 8. The method of claim 6 whereinscanning to the right comprises: if a zero signal is detected,designating the location at which the zero signal is detected as theright expanded boundary for the peak; if the end of a trace is detected,designating the location at which the end of the trace is detected asthe right expanded boundary for the peak; if a local minimum isdetected, designating the location at which the local minimum isdetected as the right expanded boundary for the peak; if a leftinflection point of a next peak is detected, designating the location atwhich the left inflection point of the next peak is detected as theright expanded boundary for the peak; if the right expanded boundary isto the right of a left expanded boundary of the next peak, redefiningthe right expanded boundary of the peak and left expanded boundary ofthe next peak as the midpoint between a right inflection point of thepeak and the left inflection point of the next peak.
 9. The method ofclaim 6 wherein expanding further comprises: classifying each of theexpanded peaks based on the type of expanded left boundary and the typeof expanded right boundary.
 10. The method of claim 2 wherein resolvingcomprises: fitting the plurality of expanded peaks using a model of apeak shape; and computing a peak resolution.
 11. The method of claim 10wherein fitting comprises: deriving an equation representing an outputof an electrophoresis of a plurality of DNA fragments, the equationaccounting for electromigration and diffusion; solving the equationusing a rectangular peak of finite width as an initial condition toobtain a peak shape expression dependent on three adjustable parameters;for each single peak of the plurality of peaks, computing the pluralityof intrinsic peak characteristics using the peak shape expression; andfor each multiple peak portion of the plurality of peaks, breaking themultiple peak portion into at least two descendent peaks; and computingthe plurality of intrinsic peak characteristics for each of thedescendent peaks using the peak shape expression.
 12. The method ofclaim 11 wherein breaking comprises: computing a set of average peakcharacteristics based on analysis of a set of preceding single peaks;calculating a first peak shape parameter and a second peak shapeparameter based on the set of average peak characteristics; for each ofthe descendent peaks, iteratively computing an intrinsic peak positionand an initial height based on the first peak shape parameter, thesecond peak shape parameter, and the average peak characteristics. 13.The method of claim 10 wherein computing a peak resolution comprises:dividing an area under a curve representing the absolute value of thedifference between an apparent peak signal of an apparent peak of theplurality of peaks and a corresponding fitted model peak by an area ofthe apparent peak.
 14. The method of claim 1 wherein creating comprises:refining the plurality of original base calls using the plurality ofintrinsic peak characteristics; inserting a plurality of newly calledbases using the plurality of intrinsic peak characteristics.
 15. Themethod of claim 14 wherein refining comprises: calling true peaks;resolving wide peaks; re-calling unknown bases; and removing unmatchedbases.
 16. The method of claim 14 wherein refining comprises: for eachknown original base call, scanning a peak list of a particular colorbase; if a peak is found at a location of the original base call and thepeak is not assigned, calling the peak and assigning a correspondingbase to the peak; if the peak is found at the location of the originalbase call and the peak has been assigned, determining whether the peakis wide enough to be split; if the peak is wide enough to be split,resolving the peak into two peaks; if the peak is not wide enough to besplit and no good peaks are located near the found peak, calling thepeak and assigning a corresponding base to the peak; and if the peak isnot wide enough to be split and a good peak is located near the foundpeak, converting the found base call to unknown.
 17. The method of claim14 wherein refining comprises: for each unknown base call, scanning fourpeak lists, one peak list for each base; if a peak is found at alocation of an unknown base call, obtaining a best peak; if the bestpeak is not assigned, calling the best peak and assigning acorresponding base to the best peak; if the best peak has been assigned,determining whether the best peak is wide enough to be split; if thebest peak is wide enough to be split, resolving the best peak into twopeaks, calling the two peaks, and assigning corresponding bases to thetwo peaks; if the best peak is not wide enough to be split, assigningthe corresponding base to the best peak; if no peaks are found at thelocation of the unknown base call, locating a good peak near thelocation of the unknown base call; if no good peak is located near thelocation of the unknown base call, rejecting the unknown base call; andif the good peak is located near the location of the unknown base call,calling the good peak and assigning the corresponding base to the peak.18. The method of claim 11 wherein inserting the plurality of newlycalled bases comprises: creating a multi-colored list of all peaks; andcalling those peaks in the multi-colored list of all peaks that meet allof a plurality of criteria.
 19. The method of claim 18 wherein theplurality of criteria comprise: whether an index of an uncalled peak isgreater than a specified minimum; whether an intrinsic height of theuncalled peak is greater than an intrinsic signal of any other peaks atthe position of the uncalled peak; and whether a spacing between twoadjacent called peaks is large enough for insertion of a new base. 20.The method of claim 1 wherein assigning comprises: computing a pluralityof peak trace parameters for each intrinsic peak in the plurality ofrefined base calls; and obtaining a quality value for each base in theplurality of refined base calls from a look-up table.
 21. The method ofclaim 20 wherein the plurality of peak trace parameters comprise: afirst peak height ratio for a current peak based on a first plurality ofcalled peaks centered at a current peak; a second peak height ratio forthe current peak based on a second plurality of called peaks centered atthe current peak; a peak spacing ratio for the current peak based on thesecond plurality of called peaks centered at the current peak; and apeak resolution.
 22. The method of claim 21 wherein the first pluralityof called peaks centered at the current peak comprise three calledpeaks, and wherein the second plurality of called peaks centered at thecurrent peak comprise seven called peaks.
 23. The method of claim 20wherein obtaining comprises: selecting the quality value from thelook-up table from the row in the look-up table in which a plurality oftable trace parameters in the row of the look-up table each exceed theplurality of peak trace parameters for a current peak.
 24. The method ofclaim 1 further comprising: training the system by generating a trainingfile and a look-up table.
 25. The method of claim 24 wherein training isconducted at each of a plurality of user locations based on a pluralityof characteristics and a plurality of requirements of each of theplurality of user locations.
 26. A machine readable medium having storedthereon instructions which when executed by a processor cause themachine to perform operations comprising: processing a plurality ofinformation obtained from a base calling system; creating a plurality ofrefined base calls using a plurality of original base calls and aplurality of intrinsic peak characteristics; and assigning a qualityvalue to each of the plurality of refined base calls using the pluralityof intrinsic peak characteristics.
 27. The machine readable medium ofclaim 26 wherein processing comprises: detecting a plurality of peaks;expanding the plurality of peaks into a plurality of expanded peaks; andresolving the plurality of expanded peaks.
 28. The machine readablemedium of claim 27 wherein detecting comprises: identifying a pluralityof inflection points; and selecting the plurality of peaks based on theplurality of inflection points.
 29. The machine readable medium of claim28 wherein detecting further comprises: computing a plurality ofapparent peak characteristics.
 30. The machine readable medium of claim27 wherein expanding comprises: scanning to the left from a leftinflection point of each peak of the plurality of peaks to determine anexpanded left boundary for the peak; and scanning to the right from aright inflection point of each peak of the plurality of peaks todetermine an expanded right boundary for the peak.
 31. The machinereadable medium of claim 27 wherein resolving comprises: fitting theplurality of expanded peaks using a model of a peak shape; and computinga peak resolution.
 32. The machine readable medium of claim 31 whereinfitting comprises: deriving an equation representing an output of anelectrophoresis of a plurality of DNA fragments, the equation accountingfor electromigration and diffusion; solving the equation using arectangular peak of finite width as an initial condition to obtain apeak shape expression dependent on a set of adjustable parameters; foreach single peak of the plurality of peaks, computing the plurality ofintrinsic peak characteristics using the peak shape expression; and foreach multiple peak portion of the plurality of peaks, breaking themultiple peak portion into at least two descendent peaks; and computingthe plurality of intrinsic peak characteristics for each of thedescendent peaks using the peak shape expression.
 33. The machinereadable medium of claim 32 wherein breaking comprises: computing a setof average peak characteristics based on analysis of a set of precedingsingle peaks; calculating a first peak shape parameter and a second peakshape parameter based on the set of average peak characteristics; foreach of the descendent peaks, iteratively computing an intrinsic peakposition and an initial height based on the first peak shape parameter,the second peak shape parameter, and the average peak characteristics.34. The machine readable medium of claim 31 wherein computing the peakresolution comprises: dividing an area under a curve representing theabsolute value of the difference between an apparent peak signal of anapparent peak of the plurality of peaks and a corresponding fitted modelpeak by an area of the apparent peak.
 35. The machine readable medium ofclaim 26 wherein creating comprises: refining the plurality of originalbase calls using the plurality of intrinsic peak characteristics;inserting a plurality of newly called bases using the plurality ofintrinsic peak characteristics.
 36. The machine readable medium of claim35 wherein refining comprises: calling true peaks; resolving wide peaks;re-calling unknown bases; and removing unmatched bases.
 37. The machinereadable medium of claim 35 wherein refining comprises: for each knownoriginal base call, scanning a peak list of a particular color base; ifa peak is found at a location of the original base call and the peak isnot assigned, calling the peak and assigning a corresponding base to thepeak; if the peak is found at the location of the original base call andthe peak has been assigned, determining whether the peak is wide enoughto be split; if the peak is wide enough to be split, resolving the peakinto two peaks; if the peak is not wide enough to be split and no goodpeaks are located near the found peak, calling the peak and assigning acorresponding base to the peak; and if the peak is not wide enough to besplit and a good peak is located near the found peak, converting thefound base call to unknown.
 38. The machine readable medium of claim 35wherein refining comprises: for each unknown base call, scanning fourpeak lists, one peak list for each base; if a peak is found at alocation of an unknown base call, obtaining a best peak; if the bestpeak is not assigned, calling the best peak and assigning acorresponding base to the best peak; if the best peak has been assigned,determining whether the best peak is wide enough to be split; if thebest peak is wide enough to be split, resolving the best peak into twopeaks, calling the two peaks, and assigning corresponding bases to thetwo peaks; if the best peak is not wide enough to be split, assigningthe corresponding base to the best peak; if no peaks are found at thelocation of the unknown base call, locating a good peak near thelocation of the unknown base call; if no good peak is located near thelocation of the unknown base call, rejecting the unknown base call; andif the good peak is located near the location of the unknown base call,calling the good peak and assigning the corresponding base to the peak.39. The machine readable medium of claim 35 wherein inserting theplurality of newly called bases comprises: creating a multi-colored listof all peaks; and calling those peaks in the multi-colored list of allpeaks that meet all of a plurality of criteria.
 40. The machine readablemedium of claim 39 wherein the plurality of criteria comprise: whetheran index of an uncalled peak is greater than a specified minimum;whether an intrinsic height of the uncalled peak is greater than anintrinsic signal of any other peaks at the position of the uncalledpeak; and whether a spacing between two adjacent called peaks is largeenough for insertion of a new base.
 41. The machine readable medium ofclaim 26 wherein assigning comprises: computing a plurality of peaktrace parameters for each intrinsic peak in the plurality of refinedbase calls; and obtaining a quality value for each base in the pluralityof refined base calls from a look-up table.
 42. The machine readablemedium of claim 41 wherein the plurality of peak trace parameterscomprise: a first peak height ratio for a current peak based on a firstplurality of called peaks centered at a current peak; a second peakheight ratio for the current peak based on a second plurality of calledpeaks centered at the current peak; a peak spacing ratio for the currentpeak based on the second plurality of called peaks centered at thecurrent peak; and a peak resolution.
 43. The machine readable medium ofclaim 41 wherein obtaining comprises: selecting the quality value fromthe look-up table from the row in the look-up table in which a pluralityof table trace parameters in the row of the look-up table each exceedthe plurality of peak trace parameters for a current peak.
 44. Themachine readable medium of claim 26 containing further instructionswhich when executed by a processor cause the machine to performoperations comprising: training the system by generating a training fileand a look-up table.